Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    confusion_matrix,
)
from sklearn import metrics

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To suppress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black

Source: MT_Project_LearnerNotebook_LowCode.ipynb

Loading the dataset¶

In [2]:
from google.colab import drive
drive.mount('/content/drive')
wind=pd.read_csv('/content/drive/MyDrive/Train.csv.csv')
Mounted at /content/drive

Data Overview¶

  • Observations
  • Sanity checks
In [ ]:
wind.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

40 variables with 18 values missing for v1 and v2. These must be given values in order to run the models. All values are type float except for Target, which is classified as a integer. All variables are given coded names. Therfore, recommendations must be given using codes alone.

In [ ]:
wind.shape
Out[ ]:
(20000, 41)

20,000 rows and 40 columns.

In [ ]:
wind.isnull().sum()
Out[ ]:
V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64

As stated earlier, there are 18 missing values in v1 and v2.

In [ ]:
wind.head(10)
Out[ ]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.465 -4.679 3.102 0.506 -0.221 -2.033 -2.911 0.051 -1.522 3.762 -5.715 0.736 0.981 1.418 -3.376 -3.047 0.306 2.914 2.270 4.395 -2.388 0.646 -1.191 3.133 0.665 -2.511 -0.037 0.726 -3.982 -1.073 1.667 3.060 -1.690 2.846 2.235 6.667 0.444 -2.369 2.951 -3.480 0
1 3.366 3.653 0.910 -1.368 0.332 2.359 0.733 -4.332 0.566 -0.101 1.914 -0.951 -1.255 -2.707 0.193 -4.769 -2.205 0.908 0.757 -5.834 -3.065 1.597 -1.757 1.766 -0.267 3.625 1.500 -0.586 0.783 -0.201 0.025 -1.795 3.033 -2.468 1.895 -2.298 -1.731 5.909 -0.386 0.616 0
2 -3.832 -5.824 0.634 -2.419 -1.774 1.017 -2.099 -3.173 -2.082 5.393 -0.771 1.107 1.144 0.943 -3.164 -4.248 -4.039 3.689 3.311 1.059 -2.143 1.650 -1.661 1.680 -0.451 -4.551 3.739 1.134 -2.034 0.841 -1.600 -0.257 0.804 4.086 2.292 5.361 0.352 2.940 3.839 -4.309 0
3 1.618 1.888 7.046 -1.147 0.083 -1.530 0.207 -2.494 0.345 2.119 -3.053 0.460 2.705 -0.636 -0.454 -3.174 -3.404 -1.282 1.582 -1.952 -3.517 -1.206 -5.628 -1.818 2.124 5.295 4.748 -2.309 -3.963 -6.029 4.949 -3.584 -2.577 1.364 0.623 5.550 -1.527 0.139 3.101 -1.277 0
4 -0.111 3.872 -3.758 -2.983 3.793 0.545 0.205 4.849 -1.855 -6.220 1.998 4.724 0.709 -1.989 -2.633 4.184 2.245 3.734 -6.313 -5.380 -0.887 2.062 9.446 4.490 -3.945 4.582 -8.780 -3.383 5.107 6.788 2.044 8.266 6.629 -10.069 1.223 -3.230 1.687 -2.164 -3.645 6.510 0
5 0.160 -4.234 -0.264 -5.477 -0.191 -0.356 -0.134 4.067 -3.859 1.692 0.138 3.975 0.673 1.878 0.764 4.236 -2.129 2.348 -2.147 -0.982 0.386 1.011 3.419 0.996 0.061 -3.037 1.788 -1.727 0.308 1.902 4.666 3.227 0.629 -1.549 1.322 5.461 1.109 -3.870 0.274 2.806 0
6 -0.185 -4.721 0.865 -3.079 -2.227 -1.282 -0.805 3.290 -1.568 0.750 0.529 3.221 2.945 1.724 -0.923 2.535 -1.697 0.677 -0.246 2.748 -1.165 0.248 1.161 -2.850 0.503 -3.532 1.861 -1.465 0.874 2.418 0.939 -0.545 -0.763 0.816 1.889 3.624 1.556 -5.433 0.679 0.465 0
7 1.735 1.683 -1.269 4.601 -1.417 -2.544 0.132 -0.199 3.094 -1.109 -1.662 0.944 3.481 0.137 -3.473 -4.076 1.727 -1.909 3.569 2.512 -4.579 3.063 3.686 0.611 -0.430 0.880 -0.994 1.134 -3.768 -0.692 -5.244 1.717 -3.839 1.569 1.795 -4.269 -0.516 -0.619 -0.831 -4.967 1
8 1.782 1.315 4.249 -0.518 -0.149 0.033 -1.088 -3.118 0.625 1.567 -0.415 -1.401 2.607 -1.024 -2.878 -4.524 -4.354 0.107 1.299 -3.596 -5.409 0.633 -3.043 0.965 -0.266 4.671 1.847 -2.321 -1.318 -0.682 3.281 1.611 2.951 -1.862 4.390 1.371 -2.516 0.770 0.831 -2.311 0
9 -0.894 4.011 5.252 3.321 0.727 -4.771 1.031 3.632 -1.391 -1.967 -4.779 6.617 -0.148 -2.513 0.734 0.475 5.085 -2.361 4.561 2.287 -2.307 -0.949 -0.301 2.546 0.738 4.266 -4.145 -0.013 -1.469 -2.003 1.680 -0.636 -4.449 2.296 1.575 1.376 0.597 -1.414 0.544 0.035 0

Values range from negative to positive numbers. Instructions for the project stated that the negative numbers need not be transformed.

In [ ]:
wind.describe()
Out[ ]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
count 19982.000 19982.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000
mean -0.272 0.440 2.485 -0.083 -0.054 -0.995 -0.879 -0.548 -0.017 -0.013 -1.895 1.605 1.580 -0.951 -2.415 -2.925 -0.134 1.189 1.182 0.024 -3.611 0.952 -0.366 1.134 -0.002 1.874 -0.612 -0.883 -0.986 -0.016 0.487 0.304 0.050 -0.463 2.230 1.515 0.011 -0.344 0.891 -0.876 0.056
std 3.442 3.151 3.389 3.432 2.105 2.041 1.762 3.296 2.161 2.193 3.124 2.930 2.875 1.790 3.355 4.222 3.345 2.592 3.397 3.669 3.568 1.652 4.032 3.912 2.017 3.435 4.369 1.918 2.684 3.005 3.461 5.500 3.575 3.184 2.937 3.801 1.788 3.948 1.753 3.012 0.229
min -11.876 -12.320 -10.708 -15.082 -8.603 -10.227 -7.950 -15.658 -8.596 -9.854 -14.832 -12.948 -13.228 -7.739 -16.417 -20.374 -14.091 -11.644 -13.492 -13.923 -17.956 -10.122 -14.866 -16.387 -8.228 -11.834 -14.905 -9.269 -12.579 -14.796 -13.723 -19.877 -16.898 -17.985 -15.350 -14.833 -5.478 -17.375 -6.439 -11.024 0.000
25% -2.737 -1.641 0.207 -2.348 -1.536 -2.347 -2.031 -2.643 -1.495 -1.411 -3.922 -0.397 -0.224 -2.171 -4.415 -5.634 -2.216 -0.404 -1.050 -2.433 -5.930 -0.118 -3.099 -1.468 -1.365 -0.338 -3.652 -2.171 -2.787 -1.867 -1.818 -3.420 -2.243 -2.137 0.336 -0.944 -1.256 -2.988 -0.272 -2.940 0.000
50% -0.748 0.472 2.256 -0.135 -0.102 -1.001 -0.917 -0.389 -0.068 0.101 -1.921 1.508 1.637 -0.957 -2.383 -2.683 -0.015 0.883 1.279 0.033 -3.533 0.975 -0.262 0.969 0.025 1.951 -0.885 -0.891 -1.176 0.184 0.490 0.052 -0.066 -0.255 2.099 1.567 -0.128 -0.317 0.919 -0.921 0.000
75% 1.840 2.544 4.566 2.131 1.340 0.380 0.224 1.723 1.409 1.477 0.119 3.571 3.460 0.271 -0.359 -0.095 2.069 2.572 3.493 2.512 -1.266 2.026 2.452 3.546 1.397 4.130 2.189 0.376 0.630 2.036 2.731 3.762 2.255 1.437 4.064 3.984 1.176 2.279 2.058 1.120 0.000
max 15.493 13.089 17.091 13.236 8.134 6.976 8.006 11.679 8.138 8.108 11.826 15.081 15.420 5.671 12.246 13.583 16.756 13.180 13.238 16.052 13.840 7.410 14.459 17.163 8.223 16.836 17.560 6.528 10.722 12.506 17.255 23.633 16.692 14.358 15.291 19.330 7.467 15.290 7.760 10.654 1.000
In [ ]:
wind["Target"].value_counts(normalize=True)
Out[ ]:
0   0.945
1   0.056
Name: Target, dtype: float64

Observations

  1. 18 missing values in variables v1 and v2.
  2. Slight skews in distributions for v4,v9,v23,v27,v30, and v40.
  3. Marked skews in distributions for v1,v6,v8,v10,v11,v17,v18,v25,v32,v34,v37.
  4. Every variable contains outliers. Further analysis will be required to determine whether these outliers represent continuous values.
  5. The ration between "failure" and "no failure" in the target variable is 05:95. That means that the overall component failure rate in this data set is 5%.
In [ ]:
wind.nunique()
Out[ ]:
V1        19982
V2        19982
V3        20000
V4        20000
V5        20000
V6        20000
V7        20000
V8        20000
V9        20000
V10       20000
V11       20000
V12       20000
V13       20000
V14       20000
V15       20000
V16       20000
V17       20000
V18       20000
V19       20000
V20       20000
V21       20000
V22       20000
V23       20000
V24       20000
V25       20000
V26       20000
V27       20000
V28       20000
V29       20000
V30       20000
V31       20000
V32       20000
V33       20000
V34       20000
V35       20000
V36       20000
V37       20000
V38       20000
V39       20000
V40       20000
Target        2
dtype: int64

No indication of unique identifiers that would need to be dropped from the dataset. All the values in this set with the exception of the target variable are continuous numeric values.

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

In [ ]:
sns.set_style("darkgrid")
wind.hist(figsize=(20,15))
plt.show()

Source: InnHotels Learner Notebook, Full Code

In [ ]:
num_cols=wind.select_dtypes(include=np.number).columns.tolist()
In [ ]:
plt.figure(figsize=(10,10))
plt.boxplot(wind[num_cols],whis=1.5)
plt.show()

Observation:The majority of these variables contain non-continuous outliers. If these outliers are not treated, it may be difficult to get a good generalized model. However, these outliers also represent genuine values. Therefore, removing them may not provide a true picture of conditions in the environment.

Plotting all the features at one go¶

In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=True, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
for feature in wind.columns:
    histogram_boxplot(wind, feature, figsize=(12, 7), kde=True, bins=None) ## Please change the dataframe name as you define while reading the data

Source: MT Project Learner Notebook, Low Code

Observations:

  1. All of the distributions are reletively symmetrical. However, there are definate skews in variables v1,v18,v22,v26,v29,v30,v34,v35, and v37. These are variables, where the skew is causing irregularities in the distribution as well as significant differences between the mean and median. Theses variables would be the focus of any future outlier treatment. The rest would be left alone. However, removing true data points may also give an incomplete picture of real life conditions in the operating environment. Therfore, the outliers have been noted, but will not be imputed in this dataset to better simulate genuine environmental conditions.
In [ ]:
plt.figure(figsize=(35,35))
sns.heatmap(data=wind[["V1","V2","V3","V4","V5","V6","V7",
                       "V8","V9","V10","V11","V12","V13","V14","V15",
                       "V16","V17","V18","V19","V20","V21","V22","V23",
                       "V24","V25","V26","V27","V28","V29","V30","V31","V32",
                       "V33","V34","V35","V36","V37","V38","V39","V40","Target"]]
            .corr(),annot=True,cbar=False,cmap="Spectral")
Out[ ]:
<Axes: >

*Source: Video, Intro to Python with Daniel Mitchell: 3.5, Heatmap

Correlational Analysis: This analysis will focus on variables that are highly correlated, those with a correlational coefficient of 0.70 or -0.70 or grearer.

  1. No single variable is highly correlated with "Target."
  2. The following variables are highly correlated: A.v2 and v14 along with v26 (-0.85 and 0.79). B.v3 and v23 (-0.79) C.v6 and v16 (-0.75) D.V7 and v15 (0.87)
  3. v8 and v16, v23, and v29 (0.80, 0.72, 0.81)
  4. v9 and v16 (0.75).
  5. v10 and v19 (-0.70)
  6. v11 and v6 and v29 (0.71 and 0.81)
  7. v14 and v38 (-0.76)
  8. v16 and v21 (0.84)
  9. v17 and v27 (-0.71)
  10. v21 and v35 (-0.70)
  11. v19 and 34 and 39 (0.76, -0.70)
  12. v24 and v27 and v32 (-0.76, 0.83)
  13. v25 and v27, v30, v32, v33 (0.77, -0.76, -0.71,-0.74)
  14. v27 and v32 (-0.77)
  15. v36 and v39 (0.75) These variables may overlap in what they measure and could be dropped in a future model to simplify it.

Data Pre-processing¶

In [ ]:
#Always make a copy before manipulation.
wind2=wind.copy()
In [ ]:
X = wind2.drop("Target",axis=1)
y = wind2.pop("Target")
In [ ]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
In [ ]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 40 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
dtypes: float64(40)
memory usage: 6.1 MB
In [ ]:
y.info()
<class 'pandas.core.series.Series'>
RangeIndex: 20000 entries, 0 to 19999
Series name: Target
Non-Null Count  Dtype
--------------  -----
20000 non-null  int64
dtypes: int64(1)
memory usage: 156.4 KB

Missing value imputation¶

In [ ]:
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")

# fit the imputer on train data and transform the train data
X_train["V1"] = imp_median.fit_transform(X_train[["V1"]])
X_train["V2"] = imp_median.fit_transform(X_train[["V2"]])
In [ ]:
# transform the validation data using the imputer fit on validation data
X_val["V1"] = imp_median.fit_transform(X_val[["V1"]])
X_val["V2"] = imp_median.fit_transform(X_val[["V2"]])

Source: Hyperparameter Tuning with Professor Rao: 1.5 Handson Oversampling and Undersampling

Model Building¶

Model Building with original data¶

In [ ]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogR", LogisticRegression(random_state=1)))
models.append(("DTree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("RF",RandomForestClassifier(random_state=1)))
models.append(("GB", GradientBoostingClassifier(random_state=1)))


## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogR: 0.4902481389578163
DTree: 0.7078246484698097
AdaBoost: 0.6434656741108354
Bagging: 0.707808105872622
RF: 0.7194127377998345
GB: 0.7220016542597187

Validation Performance:

LogR: 0.5015015015015015
DTree: 0.7057057057057057
AdaBoost: 0.6516516516516516
Bagging: 0.7267267267267268
RF: 0.7357357357357357
GB: 0.7357357357357357

Source: MT Project LearnerNotebook Low Code

In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Source: MT Project Learner Notebook Low Code

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.

Observations: Recall is the metric that compares TPs to FNs, and was the metric used to determine the best model in this group. Failure to predict when a component will fail is more costly than predicting a component will fail when it doesn't fail.This company wants to replace comonents before they fail to prevent shutdowns in energy production. In order to prevent component failure before replacement, each model should maximize recall scores. Of these six models, the model with the highest recall score and the least amount of overfitting is GB or Gradiant Boosting with a training score of 72.20 and a validation score of 73.57. Recall is sensitive to class imbalance. Therefore, over and undersampling will likely increase recall in all of these models. Very little symmetry in all the distributions.

Sample Decision Tree model building with original data

Model Building with Oversampled data¶

In [ ]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
In [ ]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))


print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 777
Before OverSampling, counts of label '0': 13223 

After OverSampling, counts of label '1': 13223
After OverSampling, counts of label '0': 13223 

After OverSampling, the shape of train_X: (26446, 40)
After OverSampling, the shape of train_y: (26446,) 

Source: MT Project Learner Notebook Low Code

Observation:As stated in the EDA, the balance between failure and not-failure was 5% to 95%. After oversampling, the balance is 50%/50%.

In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogRO", LogisticRegression(random_state=1)))
models.append(("BaggingO", BaggingClassifier(random_state=1)))
models.append(("DTreeO", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostO", AdaBoostClassifier(random_state=1)))
models.append(("RFO", RandomForestClassifier(random_state=1)))
models.append(("GBO", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )  ## Complete the code to build models on oversampled data
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over,y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogRO: 0.8917044404851445
BaggingO: 0.975119441528989
DTreeO: 0.970128321355339
AdaBoostO: 0.904787470436327
RFO: 0.9829090368319754
GBO: 0.9329201902370526

Validation Performance:

LogRO: 0.8498498498498499
BaggingO: 0.8258258258258259
DTreeO: 0.7837837837837838
AdaBoostO: 0.8618618618618619
RFO: 0.8558558558558559
GBO: 0.8768768768768769
In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Observations: The best fit model after oversampling and the most symmetrical distribution was LogRO with training/validation recall comparison scores of 89.17/84.98. AdaBoostO is a slightly better fit model with a training/validation recall score comparison of 90.47/86.18. However, the distribution is not as symmetrical. DTreeO has the most symmetical distribution. However, with training/validatioon recall scores of 97.01/78.37, it is one of the most overfit models.

Model Building with Undersampled data¶

In [ ]:
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 777
Before UnderSampling, counts of label '0': 13223 

After UnderSampling, counts of label '1': 777
After UnderSampling, counts of label '0': 777 

After UnderSampling, the shape of train_X: (1554, 40)
After UnderSampling, the shape of train_y: (1554,) 

MT Project Learner Notebook Low Code

Observation: Just like the last round, the class imbalance before undersampling is 5.5%/94.5%.

In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogRU", LogisticRegression(random_state=1)))
models.append(("BaggingU", BaggingClassifier(random_state=1)))
models.append(("DTreeU", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostU", AdaBoostClassifier(random_state=1)))
models.append(("RFU", RandomForestClassifier(random_state=1)))
models.append(("GBU", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )  ## Complete the code to build models on oversampled data
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un,y_train_un)## Complete the code to build models on oversampled data
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogRU: 0.8726220016542598
BaggingU: 0.880339123242349
DTreeU: 0.8622167080231596
AdaBoostU: 0.8725971877584782
RFU: 0.9034822167080232
GBU: 0.8932009925558313

Validation Performance:

LogRU: 0.8468468468468469
BaggingU: 0.8708708708708709
DTreeU: 0.8408408408408409
AdaBoostU: 0.8588588588588588
RFU: 0.8828828828828829
GBU: 0.8828828828828829
In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Observations: BaggingU is the closest fit model so far with a training/validation recall score of 88.03/87.08. Unfortunately, the presence of outliers skews the distribution. Just like the base models, none of the undersampled models

*Comparing the modles so far

Model Comparison so far

In [ ]:
comparison_frame1 = pd.DataFrame({'Base Model':['LogR','DTree','AdaBoost','Bagging',
                                          'RF','GB'],
                                  'Train_Recall':[0.49,0.70,0.64,0.70,0.71,0.72],
                                  'Val_Recall':[0.50,0.70,0.65,0.72,0.73,0.73]})
comparison_frame2=pd.DataFrame({'Oversample':['LogRO','BaggingO','DTreeO','AdaBoostO','RFO','GBO'],
                                 'Train Recall':[0.89,0.97,0.97,0.90,0.98,0.93],
                                 'Val_Recall':[0.84,0.82,0.78,0.86,0.85,0.87],})
comparison_frame3=pd.DataFrame({'Undersample':['LorRU','DTreeU','AdaBoostU','BaggingU','RFU','GBU'],
                                 'Train_Recall':[0.87,0.88,0.86,0.87,0.90,0.89],
                                 'Test_Recall':[0.84,0.87,0.84,0.85,0.88,0.88]})

Source: EasyVisa Learner Notebook Full Code

In [ ]:
comparison_frame1
Out[ ]:
Base Model Train_Recall Val_Recall
0 LogR 0.490 0.500
1 DTree 0.700 0.700
2 AdaBoost 0.640 0.650
3 Bagging 0.700 0.720
4 RF 0.710 0.730
5 GB 0.720 0.730
In [ ]:
comparison_frame2
Out[ ]:
Oversample Train Recall Val_Recall
0 LogRO 0.890 0.840
1 BaggingO 0.970 0.820
2 DTreeO 0.970 0.780
3 AdaBoostO 0.900 0.860
4 RFO 0.980 0.850
5 GBO 0.930 0.870
In [ ]:
comparison_frame3
Out[ ]:
Undersample Train_Recall Test_Recall
0 LorRU 0.870 0.840
1 DTreeU 0.880 0.870
2 AdaBoostU 0.860 0.840
3 BaggingU 0.870 0.850
4 RFU 0.900 0.880
5 GBU 0.890 0.880

Observations:The four models with the best recall train/val scores are DTreeU (0.88/0.87), AdaBoostU (0.86/0.84), BaggingU (0.87/0.85) and GBU (0.89/0.88).

HyperparameterTuning¶

LogRU

In [ ]:
# defining model
LogRO_tuned = LogisticRegression(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'C':np.arange(0.1,1.1,0.1)}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=LogRO_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'C': 0.2} with CV score=0.8920823693264202:

Sources: Easy Visa Learner Notebook Full Code and MT Learner Notebook Full Code

In [ ]:
LogRO_tuned.get_params()
Out[ ]:
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 1,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Source: Practice Notebook Hyperparameter Tuning

In [ ]:
# Set the clf to the best combination of parameters
LogRO_best = LogisticRegression(
    C=0.2,
    class_weight="balanced",
    dual=False,
    fit_intercept=True,
    l1_ratio=1,
    max_iter=100,
    multi_class="auto",
    n_jobs=None,
    random_state=1,
    solver='lbfgs',
    tol=0.0001,
    verbose=0,
    warm_start=False)

# Fit the best algorithm to the data.
LogRO_best.fit(X_train_over, y_train_over)
Out[ ]:
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, LogRO_best.predict(X_train_over)))
print(accuracy_score(y_val, LogRO_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, LogRO_best.predict(X_train_over)))
print(recall_score(y_val, LogRO_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, LogRO_best.predict(X_train_over)))
print(precision_score(y_val, LogRO_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, LogRO_best.predict(X_train_over)))
print(f1_score(y_val, LogRO_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.8856159721697043
0.8678333333333333
Recall on train and validation set
0.8919307267639719
0.8498498498498499
Precision on train and validation set
0.8808065720687079
0.27582846003898637
F1 on train and validation set
0.8863337466651636
0.4164827078734364

In [ ]:
# defining model
DTreeU_tuned = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=DTreeU_tuned, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score=0.8506368899917287:

Sources: Easy Visa Project Learner Notebook Full Code and MT Project Learner Notebook Full Code

In [ ]:
DTreeU_tuned.get_params()
Out[ ]:
{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 1,
 'splitter': 'best'}
In [ ]:
DTreeU_best = DTreeU_tuned

# Fit the best algorithm to the data.
DTreeU_best.fit(X_train_un, y_train_un)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Note:I had to fit this a second time in order to get a recall score because when I ran "DTreeU_tuned" through, Python said it hadn't been fitted yet.

In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(accuracy_score(y_val, DTreeU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(recall_score(y_val, DTreeU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(precision_score(y_val, DTreeU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, DTreeU_best.predict(X_train_over)))
print(f1_score(y_val, DTreeU_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.890607275202299
0.8315
Recall on train and validation set
0.9316342736141572
0.8408408408408409
Precision on train and validation set
0.8609868604976237
0.22617124394184168
F1 on train and validation set
0.8949184555591879
0.35646085295989816

AdaBoostU

In [ ]:
# defining model
AdaBoostU_tuned = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators':np.arange,
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=DTreeU_tuned, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Source: Easy Visa Project Learner Notebook Full Code

In [ ]:
AdaBoostClassifier().get_params()
Out[ ]:
{'algorithm': 'SAMME.R',
 'base_estimator': 'deprecated',
 'estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}
In [ ]:
# defining model
AdaBoostU_tuned = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'learning_rate': [0.001,0.01,0.1,1.0],
            'n_estimators': [50,100,150,200],
             }


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=AdaBoostU_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'n_estimators': 100, 'learning_rate': 1.0} with CV score=0.8854259718775849:

Sources: Easy Visa Project Learner Notebook Full Code

In [ ]:
# Creating new pipeline with best parameters
AdaBoostU_tuned_best = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, base_estimator= DecisionTreeClassifier(min_samples_leaf=1,
                            min_impurity_decrease=0.001, max_leaf_nodes=10,max_depth=5,random_state=1))
## Complete the code with the best parameters obtained from tuning

AdaBoostU_tuned_best.fit(X_train_un,y_train_un)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
                                                         max_leaf_nodes=10,
                                                         min_impurity_decrease=0.001,
                                                         random_state=1),
                   n_estimators=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
                                                         max_leaf_nodes=10,
                                                         min_impurity_decrease=0.001,
                                                         random_state=1),
                   n_estimators=100)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, random_state=1)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, random_state=1)
In [ ]:
AdaBoost_best=AdaBoostU_tuned_best
AdaBoost_best.fit(X_train_un, y_train_un)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
                                                         max_leaf_nodes=10,
                                                         min_impurity_decrease=0.001,
                                                         random_state=1),
                   n_estimators=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
                                                         max_leaf_nodes=10,
                                                         min_impurity_decrease=0.001,
                                                         random_state=1),
                   n_estimators=100)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, random_state=1)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, random_state=1)

NOTE:I had to fit this twice to get a recall score because the first time, I got an error that said the model had not been fitted.

In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(accuracy_score(y_val, AdaBoost_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(recall_score(y_val, AdaBoost_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(precision_score(y_val, AdaBoost_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, AdaBoost_best.predict(X_train_over)))
print(f1_score(y_val, AdaBoost_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.9537548211449747
0.9345
Recall on train and validation set
0.9708084398396732
0.8738738738738738
Precision on train and validation set
0.9387889425186485
0.4532710280373832
F1 on train and validation set
0.9545302450087371
0.596923076923077

GBU (Gradiant Boost with undersampling)

In [ ]:
# defining model
GBU_tuned = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators":np.arange(100,150,25),
            "learning_rate":[0.2,0.05,1.0],
             }


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=GBU_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters are {'n_estimators': 100, 'learning_rate': 0.2} with CV score=0.902191894127378:

Source: Easy Visa Project Learner Notebook Full Code and MT Project Learner Notebook Full Code

In [ ]:
GBU_best = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.2,
    random_state=1)

# Fit the best algorithm to the data.
GBU_best.fit(X_train_over, y_train_over)
Out[ ]:
GradientBoostingClassifier(learning_rate=0.2, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, random_state=1)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, GBU_best.predict(X_train_over)))
print(accuracy_score(y_val, GBU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_over, GBU_best.predict(X_train_over)))
print(recall_score(y_val, GBU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_over, GBU_best.predict(X_train_over)))
print(precision_score(y_val, GBU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_over, GBU_best.predict(X_train_over)))
print(f1_score(y_val, GBU_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.9709218785449596
0.9708333333333333
Recall on train and validation set
0.9553807759207441
0.8678678678678678
Precision on train and validation set
0.9860287230721199
0.6880952380952381
F1 on train and validation set
0.9704628384866525
0.7675962815405046

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

  • For Adaboost:

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

  • For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

  • For Random Forest:

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

  • For Decision Trees:

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

  • For Logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

  • For XGBoost:

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Model performance comparison and choosing the final model¶

In [ ]:
comparison_frame = pd.DataFrame({'Best Models':['LogRO_best','DTreeU_best','AdaBoost_best','GBU_best'],
                                  'Train_Recall':[0.89,0.93,0.97,0.95],
                                  'Val_Recall':[0.84,.084,0.84,0.86]})
comparison_frame
Out[ ]:
Best Models Train_Recall Val_Recall
0 LogRO_best 0.890 0.840
1 DTreeU_best 0.930 0.084
2 AdaBoost_best 0.970 0.840
3 GBU_best 0.950 0.860

The model with the highest and best fit recall score is LogRO_best or Logistic Regression with oversampling with a recall training/validation score of 0.89/0.84. This model contains obvious overfitting. The introduction to this project suggested that some variables represnted weather data factors, which influence one another. The correlational analysis also found several varibles with high correlational coefficients. This suggests a degree of colinearity. This may be pulling down all of the scores.

Check for colinarity.

In [ ]:
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
In [ ]:
# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train_over.values, i) for i in range(X_train_over.shape[1])],
    index=X_train_over.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1              1102.325
V2              1671.602
V3    12885835843692.406
V4    22239998159854.301
V5     5792411096296.458
V6    12475345228173.119
V7    16773182969722.518
V8     4187447352273.822
V9    31493703687905.566
V10   14274483763456.406
V11   19453994070714.887
V12   21756519938987.902
V13    9748051141494.580
V14   16376725917710.895
V15   36173490982895.547
V16    6757088713234.053
V17    6424535845036.371
V18   14622076712241.871
V19   15189206163138.266
V20    8481355230452.911
V21   10891413850956.459
V22    8578285004515.230
V23   76984609014880.281
V24   16199998659606.102
V25    5345518845543.615
V26   21548323575935.387
V27   18307315558416.652
V28   15037060525444.061
V29   17455812509187.969
V30    9843933611738.789
V31    7718251289409.591
V32   12025633183899.855
V33   10341216136327.201
V34   14297141674192.051
V35   23765697242060.664
V36   18196362130789.883
V37   16742006049704.445
V38   42891425022576.156
V39   26260056136271.113
V40   15502924706955.236
dtype: float64

Every single variable is high, very high, way over 10, which indicates a high degree of colinearity. Beinning with V23, which has the highest VIF, I can try to reduce colinearity and simplify the model in the hopes of improving the scores. This must be done one variable at a time.

Source: InnHotels Project Learner Notebook Full Code

In [ ]:
X_train1 = X_train_over.drop("V23", axis=1)
In [ ]:
vif_series = pd.Series(
    [variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
    index=X_train1.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1               1102.324
V2               1671.383
V3     26727594227718.078
V4     11682489305760.041
V5     19971616972818.164
V6     15746851843952.783
V7     13964650007350.375
V8     21193410011155.273
V9     30124412223214.020
V10    17388415549693.035
V11    23395322739586.992
V12    27886065804151.680
V13    36028797018963.969
V14    20332278227406.301
V15    14365549050623.592
V16     5317118804451.589
V17   450359962737049.625
V18     8652448851816.515
V19    14790146559509.018
V20    48687563539140.500
V21    10388926476056.508
V22    10735636775615.008
V24    11820471462914.688
V25    10293942005418.277
V26    12978673277724.771
V27     3885763267791.627
V28    15087435937589.602
V29    19123565296690.004
V30    15556475396789.277
V31    23334713095183.918
V32    11161337366469.631
V33    28685347945035.008
V34    13168419963071.625
V35     9919822967776.424
V36    19164253733491.473
V37     9612806034942.361
V38    37374270766560.133
V39    16930825666806.375
V40     8619329430374.155
dtype: float64

Source: InnHotels Project Learner Notebook, Full Code

Dropping V23 significantly increased V17. I will try dropping V17 to see detemine if I can further reduce colinearity. If values continue to increase, this may not be an option.

In [ ]:
X_train2 = X_train1.drop("V17", axis=1)

vif_series = pd.Series(
    [variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
    index=X_train2.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1              1102.320
V2              1671.371
V3     9726997035357.443
V4    22860911814063.430
V5     6194772527332.182
V6    17695872799098.215
V7    15583389714084.762
V8    14388497212046.312
V9    10917817278473.930
V10   36764078590779.562
V11   10609186401343.924
V12   14504346625991.936
V13   15885712971324.500
V14    9571943947652.488
V15   27629445566690.160
V16    8378790004410.226
V18   11945887605757.283
V19   19496102282989.160
V20    5285915055599.174
V21    8085457140701.070
V22   21862134113449.008
V24   37219831631161.125
V25   39505259889214.875
V26   21094143453725.977
V27   13207037030412.012
V28    8355472406995.354
V29   25882756479140.781
V30   22350370359158.789
V31   31493703687905.566
V32   39854863959030.938
V33   30741294384781.543
V34   10761289432187.564
V35    8136584692629.622
V36   15087435937589.602
V37   29825163095168.848
V38   26569909306020.625
V39   18123137333482.883
V40    7376903566536.439
dtype: float64

Eliminating the higher number is just increasing VIF values. A second option is to look at the summary data for logistic regression and eliminate variables with the highest p values and watch thier impace pseudo R squared.

In [ ]:
# fitting the model on training set
logit = sm.Logit(y_train_over, X_train_over.astype(float))
lg = logit.fit()
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.333100
         Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [ ]:
print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26406
Method:                           MLE   Df Model:                           39
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.5194
Time:                        16:24:49   Log-Likelihood:                -8809.2
converged:                      False   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V1             0.2400      0.191      1.254      0.210      -0.135       0.615
V2             0.2652      0.450      0.590      0.555      -0.616       1.147
V3             0.6369   1.71e+05   3.72e-06      1.000   -3.36e+05    3.36e+05
V4             1.1091        nan        nan        nan         nan         nan
V5            -0.2568        nan        nan        nan         nan         nan
V6             0.0996   5.11e+05   1.95e-07      1.000      -1e+06       1e+06
V7            -0.1407        nan        nan        nan         nan         nan
V8             0.4508   1.62e+05   2.78e-06      1.000   -3.18e+05    3.18e+05
V9             0.1921   1.88e+05   1.02e-06      1.000   -3.69e+05    3.69e+05
V10            0.3600        nan        nan        nan         nan         nan
V11            0.7280        nan        nan        nan         nan         nan
V12           -0.8945   1.18e+05  -7.57e-06      1.000   -2.32e+05    2.32e+05
V13            0.2017   2.37e+05   8.53e-07      1.000   -4.64e+05    4.64e+05
V14            0.4546        nan        nan        nan         nan         nan
V15           -0.5897        nan        nan        nan         nan         nan
V16            0.6805        nan        nan        nan         nan         nan
V17            0.0037        nan        nan        nan         nan         nan
V18            0.5933   1.63e+04   3.64e-05      1.000    -3.2e+04     3.2e+04
V19            0.8416        nan        nan        nan         nan         nan
V20           -0.4028        nan        nan        nan         nan         nan
V21            0.3022   1.02e+05   2.97e-06      1.000   -1.99e+05    1.99e+05
V22            0.1903        nan        nan        nan         nan         nan
V23            0.7050   9.97e+04   7.07e-06      1.000   -1.95e+05    1.95e+05
V24           -0.3336   1.62e+05  -2.06e-06      1.000   -3.17e+05    3.17e+05
V25            0.8838        nan        nan        nan         nan         nan
V26           -0.4792        nan        nan        nan         nan         nan
V27           -0.3292        nan        nan        nan         nan         nan
V28           -0.6515        nan        nan        nan         nan         nan
V29            0.0135        nan        nan        nan         nan         nan
V30            0.1775   5.08e+04   3.49e-06      1.000   -9.96e+04    9.96e+04
V31            0.1465   8.55e+04   1.71e-06      1.000   -1.67e+05    1.67e+05
V32           -0.0396        nan        nan        nan         nan         nan
V33           -0.5451   1.17e+05  -4.66e-06      1.000   -2.29e+05    2.29e+05
V34           -0.1400        nan        nan        nan         nan         nan
V35            0.0533   3.66e+04   1.46e-06      1.000   -7.16e+04    7.16e+04
V36            0.2298        nan        nan        nan         nan         nan
V37           -0.0106        nan        nan        nan         nan         nan
V38            0.8679        nan        nan        nan         nan         nan
V39           -0.0616   3.62e+05   -1.7e-07      1.000    -7.1e+05     7.1e+05
V40            0.4523   1.69e+05   2.67e-06      1.000   -3.32e+05    3.32e+05
==============================================================================

Several of the p values exceed 0.05. However, several are also nan, probably because they are negative values. I can start with V3 the first p value of 1 and see how eliminting it changes the model.

Source: InnHotels Project Learner Notebook Full Code

In [ ]:
X_train2=X_train_over.drop(["V3"],axis=1)

LogR2=sm.Logit(y_train_over,X_train2.astype(float))
lg2=LogR2.fit()
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.333100
         Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [ ]:
print(lg2.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26407
Method:                           MLE   Df Model:                           38
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.5194
Time:                        16:41:49   Log-Likelihood:                -8809.2
converged:                      False   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V1             0.2400      0.191      1.254      0.210      -0.135       0.615
V2             0.2652      0.450      0.590      0.555      -0.616       1.147
V4             1.1246        nan        nan        nan         nan         nan
V5            -0.2648        nan        nan        nan         nan         nan
V6             0.0382   6.73e+05   5.68e-08      1.000   -1.32e+06    1.32e+06
V7            -0.1680        nan        nan        nan         nan         nan
V8             0.4570   5.27e+05   8.68e-07      1.000   -1.03e+06    1.03e+06
V9             0.1667   2.15e+06   7.76e-08      1.000   -4.21e+06    4.21e+06
V10            0.3564   2.32e+06   1.54e-07      1.000   -4.54e+06    4.54e+06
V11            0.6486   1.78e+05   3.65e-06      1.000   -3.48e+05    3.48e+05
V12           -0.8526        nan        nan        nan         nan         nan
V13            0.2020   7.23e+05   2.79e-07      1.000   -1.42e+06    1.42e+06
V14            0.3874   1.39e+06   2.78e-07      1.000   -2.73e+06    2.73e+06
V15           -0.5833        nan        nan        nan         nan         nan
V16            0.6358   7.41e+05   8.59e-07      1.000   -1.45e+06    1.45e+06
V17            0.0171   8.06e+05   2.12e-08      1.000   -1.58e+06    1.58e+06
V18            0.5636   1.41e+06      4e-07      1.000   -2.76e+06    2.76e+06
V19            0.9157   2.84e+05   3.23e-06      1.000   -5.56e+05    5.56e+05
V20           -0.3792   8.93e+05  -4.25e-07      1.000   -1.75e+06    1.75e+06
V21            0.2055   1.85e+05   1.11e-06      1.000   -3.62e+05    3.62e+05
V22            0.1185   3.49e+06    3.4e-08      1.000   -6.84e+06    6.84e+06
V23            0.5200   8.02e+05   6.48e-07      1.000   -1.57e+06    1.57e+06
V24           -0.3434        nan        nan        nan         nan         nan
V25            0.9289        nan        nan        nan         nan         nan
V26           -0.3603   9.88e+05  -3.65e-07      1.000   -1.94e+06    1.94e+06
V27           -0.3540   2.95e+05   -1.2e-06      1.000   -5.78e+05    5.78e+05
V28           -0.7054   5.38e+05  -1.31e-06      1.000   -1.05e+06    1.05e+06
V29            0.0416        nan        nan        nan         nan         nan
V30            0.1516      4e+05   3.79e-07      1.000   -7.84e+05    7.84e+05
V31            0.2821        nan        nan        nan         nan         nan
V32           -0.1214   9.78e+04  -1.24e-06      1.000   -1.92e+05    1.92e+05
V33           -0.5342   2.68e+05     -2e-06      1.000   -5.25e+05    5.25e+05
V34           -0.1276   3.39e+05  -3.76e-07      1.000   -6.65e+05    6.65e+05
V35            0.1592   1.22e+06    1.3e-07      1.000    -2.4e+06     2.4e+06
V36            0.3297   1.74e+06   1.89e-07      1.000   -3.42e+06    3.42e+06
V37           -0.0310   6.29e+05  -4.92e-08      1.000   -1.23e+06    1.23e+06
V38            0.8094   3.39e+05   2.39e-06      1.000   -6.64e+05    6.64e+05
V39           -0.0271   6.24e+05  -4.34e-08      1.000   -1.22e+06    1.22e+06
V40            0.4719   2.52e+05   1.87e-06      1.000   -4.95e+05    4.95e+05
==============================================================================

I am going to drop all the p values with a score of 1.0 all at once just to see what happens. If a problem arises, I can always go back and drop them one at a time.

In [ ]:
X_train3=X_train_over.drop(["V6","V8","V9","V10",
                            "V11","V13","V14","V16","V17","V18",
                            "V19","V20","V21","V22","V23","V26",
                            "V27","V28","V32","V33",
                            "V4","V35","V36","V37","V38","V39","V40"],axis=1)

LogR3=sm.Logit(y_train_over,X_train3.astype(float))
lg3=LogR3.fit()
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.333100
         Iterations: 35
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [ ]:
print(lg3.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26433
Method:                           MLE   Df Model:                           12
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.5194
Time:                        16:52:34   Log-Likelihood:                -8809.2
converged:                      False   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V1             0.2400      0.191      1.254      0.210      -0.135       0.615
V2             0.2652      0.450      0.590      0.555      -0.616       1.147
V3             0.2764   4.95e+04   5.58e-06      1.000   -9.71e+04    9.71e+04
V5             1.4130   8.05e+04   1.76e-05      1.000   -1.58e+05    1.58e+05
V7             0.9763   2.98e+05   3.28e-06      1.000   -5.83e+05    5.83e+05
V12           -0.6240   5.49e+04  -1.14e-05      1.000   -1.08e+05    1.08e+05
V15            0.7803   2.16e+05   3.61e-06      1.000   -4.24e+05    4.24e+05
V24           -0.7804    4.7e+04  -1.66e-05      1.000    -9.2e+04     9.2e+04
V25            0.0505   2.82e+05   1.79e-07      1.000   -5.53e+05    5.53e+05
V29           -1.9782   8.41e+04  -2.35e-05      1.000   -1.65e+05    1.65e+05
V30            2.5336   8.17e+04    3.1e-05      1.000    -1.6e+05     1.6e+05
V31           -0.0831   7.49e+04  -1.11e-06      1.000   -1.47e+05    1.47e+05
V34            0.5482   1.09e+05   5.03e-06      1.000   -2.14e+05    2.14e+05
==============================================================================

All the nan values have now been replaced with 1.0. I am going to recheck the VIF.

In [ ]:
vif_series = pd.Series(
    [variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
    index=X_train3.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1    1101.467
V2    1669.314
V3         inf
V5         inf
V7         inf
V12        inf
V15        inf
V24        inf
V25        inf
V29        inf
V30        inf
V31        inf
V34        inf
dtype: float64

In [ ]:
X_train4=X_train3.drop(["V30","V31","V34"],axis=1)

LogR4=sm.Logit(y_train_over,X_train4.astype(float))
lg4=LogR4.fit()
Optimization terminated successfully.
         Current function value: 0.333276
         Iterations 7
In [ ]:
print(lg4.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26436
Method:                           MLE   Df Model:                            9
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.5192
Time:                        17:06:48   Log-Likelihood:                -8813.8
converged:                       True   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V1             0.6155      0.020     30.575      0.000       0.576       0.655
V2             1.3846      0.035     39.812      0.000       1.316       1.453
V3            -0.8026      0.016    -51.343      0.000      -0.833      -0.772
V5            -1.0030      0.019    -52.631      0.000      -1.040      -0.966
V7            -5.9753      0.163    -36.646      0.000      -6.295      -5.656
V12            0.7081      0.027     26.267      0.000       0.655       0.761
V15            3.1218      0.071     43.706      0.000       2.982       3.262
V24           -1.1787      0.033    -35.321      0.000      -1.244      -1.113
V25           -3.4226      0.077    -44.207      0.000      -3.574      -3.271
V29           -1.6468      0.039    -42.606      0.000      -1.723      -1.571
==============================================================================
In [ ]:
vif_series = pd.Series(
    [variance_inflation_factor(X_train4.values, i) for i in range(X_train4.shape[1])],
    index=X_train4.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1     10.528
V2     25.748
V3      6.866
V5      3.072
V7    277.548
V12    24.611
V15   168.591
V24    50.735
V25    54.654
V29    25.789
dtype: float64

After playing around with the model and eliminating one at a time (V5,V7,V12,V15,V24,V25,V29,V30,V31 and V34), I discovered that I got optimal model performance (lowest p values and highest pseudo r) by eliminating (V30,V31 and 34). Now, I no longer have infinate numbers for VIF and I will try to further tune the model by eliminating variables with higest VIFs one at a time. Unfortunately, I still have a pretty low pseudo r.

In [ ]:
X_train5 = X_train4.drop("V7", axis=1)

vif_series = pd.Series(
    [variance_inflation_factor(X_train5.values, i) for i in range(X_train5.shape[1])],
    index=X_train5.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

V1    3.313
V2    2.275
V3    2.395
V5    3.065
V12   1.379
V15   3.339
V24   2.509
V25   5.992
V29   2.675
dtype: float64

Now, all VIFs are less than 10. I will check my final model for scores and then bring it into production.

In [ ]:
LogR5=sm.Logit(y_train_over,X_train5.astype(float))
lg5=LogR5.fit()
Optimization terminated successfully.
         Current function value: 0.361983
         Iterations 7
In [ ]:
print(lg5.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26437
Method:                           MLE   Df Model:                            8
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.4778
Time:                        17:17:22   Log-Likelihood:                -9573.0
converged:                       True   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V1            -0.0044      0.010     -0.425      0.671      -0.025       0.016
V2             0.1675      0.009     18.489      0.000       0.150       0.185
V3            -0.3670      0.009    -42.501      0.000      -0.384      -0.350
V5            -1.0031      0.019    -53.791      0.000      -1.040      -0.967
V12           -0.2702      0.007    -41.260      0.000      -0.283      -0.257
V15            0.5885      0.011     51.514      0.000       0.566       0.611
V24            0.0233      0.006      3.827      0.000       0.011       0.035
V25           -0.7971      0.023    -34.495      0.000      -0.842      -0.752
V29           -0.3277      0.012    -27.263      0.000      -0.351      -0.304
==============================================================================

Now, V1 has exceeded 0.05 and pseudo r has dropped slightly. I will conduct one final model tweak to see what happens when I drop V1.

In [ ]:
X_train6=X_train5.drop(["V1"],axis=1)

LogR6=sm.Logit(y_train_over,X_train6.astype(float))
lg6=LogR6.fit()
Optimization terminated successfully.
         Current function value: 0.361987
         Iterations 7
In [ ]:
print(lg6.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 Target   No. Observations:                26446
Model:                          Logit   Df Residuals:                    26438
Method:                           MLE   Df Model:                            7
Date:                Wed, 19 Jul 2023   Pseudo R-squ.:                  0.4778
Time:                        17:44:53   Log-Likelihood:                -9573.1
converged:                       True   LL-Null:                       -18331.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
V2             0.1653      0.007     22.226      0.000       0.151       0.180
V3            -0.3669      0.009    -42.529      0.000      -0.384      -0.350
V5            -0.9989      0.016    -63.406      0.000      -1.030      -0.968
V12           -0.2697      0.006    -41.790      0.000      -0.282      -0.257
V15            0.5880      0.011     51.806      0.000       0.566       0.610
V24            0.0231      0.006      3.809      0.000       0.011       0.035
V25           -0.8005      0.022    -36.863      0.000      -0.843      -0.758
V29           -0.3273      0.012    -27.333      0.000      -0.351      -0.304
==============================================================================

All the p values went above 0.05 when I dropped V1. Therfore, my final model is X_train5.

In [ ]:
LogRO_best2 = LogisticRegression(
    C=0.2,
    class_weight="balanced",
    dual=False,
    fit_intercept=True,
    l1_ratio=1,
    max_iter=100,
    multi_class="auto",
    n_jobs=None,
    random_state=1,
    solver='lbfgs',
    tol=0.0001,
    verbose=0,
    warm_start=False)

# Fit the best algorithm to the data.
LogRO_best2.fit(X_train6, y_train_over)
Out[ ]:
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.2, class_weight='balanced', l1_ratio=1, random_state=1)
In [ ]:
X_val2=X_val.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
                   "V19","V20","V21","V22","V23","V26","V27",
                   "V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
In [ ]:
X_val2=X_train6
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_over, LogRO_best2.predict(X_train6)))
print(accuracy_score(y_val, LogRO_best2.predict(X_val2)))
print("Recall on train and validation set")
print(recall_score(y_train_over, LogRO_best.predict(X_train6)))
print(recall_score(y_val, LogRO_best2.predict(X_val2)))
print("Precision on train and validation set")
print(precision_score(y_train_over, LogRO_best2.predict(X_train6)))
print(precision_score(y_val, LogRO_best2.predict(X_val2)))
print("F1 on train and validation set")
print(f1_score(y_train_over, LogRO_best2.predict(X_train6)))
print(f1_score(y_val, LogRO_best2.predict(X_val2)))
print("")

Note:I keep getting an error message stating that values for X and y don't match. I am going to take the following step just so I can get a final score.

In [ ]:
y_val.shape
Out[ ]:
(6000,)
In [ ]:
y_train_over.shape
Out[ ]:
(26446,)
In [ ]:
y_train2=y_train_over.sample(n=6000,random_state=1)
In [ ]:
X_train6.shape
Out[ ]:
(26446, 8)
In [ ]:
X_train7=X_train6.sample(n=6000,random_state=1)
In [ ]:
X_val2.shape
Out[ ]:
(26446, 8)
In [ ]:
X_val3=X_val2.sample(n=6000,random_state=1)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train2, LogRO_best2.predict(X_train7)))
print(accuracy_score(y_val, LogRO_best2.predict(X_val3)))
print("Recall on train and validation set")
print(recall_score(y_train2, LogRO_best2.predict(X_train7)))
print(recall_score(y_val, LogRO_best2.predict(X_val3)))
print("Precision on train and validation set")
print(precision_score(y_train2, LogRO_best2.predict(X_train7)))
print(precision_score(y_val, LogRO_best2.predict(X_val3)))
print("F1 on train and validation set")
print(f1_score(y_train2, LogRO_best2.predict(X_train7)))
print(f1_score(y_val, LogRO_best2.predict(X_val3)))
print("")
Accuracy on train and validation set
0.8703333333333333
0.5005
Recall on train and validation set
0.8752528658125421
0.5105105105105106
Precision on train and validation set
0.8641810918774967
0.05659121171770972
F1 on train and validation set
0.8696817420435511
0.10188792328438717

Any efforts I have made to improve this model have failed. I will go back to my original model LogRO_best and employ it.

Note:I am not satisfied with the results. I am going start over and run everything again through the simplified dataset so long as a correlational analysis reveals that most of the high correlations have been eliminiated.

In [19]:
wind_simp=wind.copy()
In [20]:
wind_simp=wind_simp.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
                   "V19","V20","V21","V22","V23","V26","V27",
                   "V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
In [ ]:
wind_simp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V2      19982 non-null  float64
 1   V3      20000 non-null  float64
 2   V5      20000 non-null  float64
 3   V12     20000 non-null  float64
 4   V15     20000 non-null  float64
 5   V24     20000 non-null  float64
 6   V25     20000 non-null  float64
 7   V29     20000 non-null  float64
 8   Target  20000 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.4 MB
In [ ]:
plt.figure(figsize=(8,8))
sns.heatmap(data=wind[["V2","V3","V5","V12","V15","V24","V25",
                       "V29","Target"]]
            .corr(),annot=True,cbar=False,cmap="Spectral")
Out[ ]:
<Axes: >

All the correlations of 0.70 or higher are gone.

  1. Preprocessing
In [21]:
X = wind_simp.drop("Target",axis=1)
y = wind_simp.pop("Target")
In [22]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
  1. Missing Value Imputation
In [23]:
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")

# fit the imputer on train data and transform the train data
X_train["V2"] = imp_median.fit_transform(X_train[["V2"]])
X_val["V2"] = imp_median.fit_transform(X_val[["V2"]])
  1. Base Models
In [24]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogR", LogisticRegression(random_state=1)))
models.append(("DTree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("RF",RandomForestClassifier(random_state=1)))
models.append(("GB", GradientBoostingClassifier(random_state=1)))


## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogR: 0.4027956989247312
DTree: 0.5817617866004963
AdaBoost: 0.34480562448304386
Bagging: 0.5211662531017369
RF: 0.5276757650951198
GB: 0.5005459057071959

Validation Performance:

LogR: 0.4084084084084084
DTree: 0.6186186186186187
AdaBoost: 0.3333333333333333
Bagging: 0.5855855855855856
RF: 0.5675675675675675
GB: 0.5195195195195195

Although the recall scores are not particulary high, there is definately less overfitting. Over or undersampling will reduce imbalance and hopefully increase the recall scores.

In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

There are still outliers that will skew the data. However, LogR and GB, now have medians in near the center of the distribution.

  1. Oversampling
In [ ]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
In [ ]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))


print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 777
Before OverSampling, counts of label '0': 13223 

After OverSampling, counts of label '1': 13223
After OverSampling, counts of label '0': 13223 

After OverSampling, the shape of train_X: (26446, 8)
After OverSampling, the shape of train_y: (26446,) 

In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogRO", LogisticRegression(random_state=1)))
models.append(("BaggingO", BaggingClassifier(random_state=1)))
models.append(("DTreeO", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostO", AdaBoostClassifier(random_state=1)))
models.append(("RFO", RandomForestClassifier(random_state=1)))
models.append(("GBO", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )  ## Complete the code to build models on oversampled data
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over,y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogRO: 0.8779405666501748
BaggingO: 0.9691452201939548
DTreeO: 0.9540956161398351
AdaBoostO: 0.8670514400761864
RFO: 0.9769347868984667
GBO: 0.9149970400578834

Validation Performance:

LogRO: 0.8348348348348348
BaggingO: 0.7927927927927928
DTreeO: 0.6996996996996997
AdaBoostO: 0.8318318318318318
RFO: 0.8078078078078078
GBO: 0.8648648648648649
In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

Nothing very impressive yet. Lots of overfitting although the DTreeO's distrution now looks almost symmetirical.

  1. Undersampling
In [25]:
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 777
Before UnderSampling, counts of label '0': 13223 

After UnderSampling, counts of label '1': 777
After UnderSampling, counts of label '0': 777 

After UnderSampling, the shape of train_X: (1554, 8)
After UnderSampling, the shape of train_y: (1554,) 

In [26]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("LogRU", LogisticRegression(random_state=1)))
models.append(("BaggingU", BaggingClassifier(random_state=1)))
models.append(("DTreeU", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoostU", AdaBoostClassifier(random_state=1)))
models.append(("RFU", RandomForestClassifier(random_state=1)))
models.append(("GBU", GradientBoostingClassifier(random_state=1)))
## Complete the code to append remaining 4 models in the list models

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )  ## Complete the code to build models on oversampled data
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un,y_train_un)## Complete the code to build models on oversampled data
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

LogRU: 0.8648800661703888
BaggingU: 0.8507526881720431
DTreeU: 0.8120430107526883
AdaBoostU: 0.8223986765922249
RFU: 0.8893382961124896
GBU: 0.875144747725393

Validation Performance:

LogRU: 0.8468468468468469
BaggingU: 0.8408408408408409
DTreeU: 0.7927927927927928
AdaBoostU: 0.8468468468468469
RFU: 0.8708708708708709
GBU: 0.8738738738738738

Much higher and closer Recall scores on the undersampled simplified data. There GBU actually fits very closely with train/val scores of .878 and .874. Those are the closest I have seen so far. BaggingU scores are a close second with .851/.841. RFU comese in third at .85/.841.

In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()
  1. Model comparisons
In [ ]:
comparison_frame1 = pd.DataFrame({'Base Model':['LogR','DTree','AdaBoost','Bagging',
                                          'RF','GB'],
                                  'Train_Recall':[0.40,0.58,0.34,0.52,0.52,0.50],
                                  'Val_Recall':[0.40,0.61,0.33,0.58,0.56,0.51]})
comparison_frame2=pd.DataFrame({'Oversample':['LogRO','BaggingO','DTreeO','AdaBoostO','RFO','GBO'],
                                 'Train Recall':[0.87,0.96,0.95,0.86,0.97,0.91],
                                 'Val_Recall':[0.83,0.79,0.70,0.83,0.81,0.86],})
comparison_frame3=pd.DataFrame({'Undersample':['LogRU','DTreeU','AdaBoostU','BaggingU','RFU','GBU'],
                                 'Train_Recall':[0.864,0.850,0.812,0.822,0.889,0.873],
                                 'Test_Recall':[0.846,0.840,0.792,0.846,0.870,0.875]})
In [ ]:
comparison_frame1
Out[ ]:
Base Model Train_Recall Val_Recall
0 LogR 0.400 0.400
1 DTree 0.580 0.610
2 AdaBoost 0.340 0.330
3 Bagging 0.520 0.580
4 RF 0.520 0.560
5 GB 0.500 0.510
In [ ]:
comparison_frame2
Out[ ]:
Oversample Train Recall Val_Recall
0 LogRO 0.870 0.830
1 BaggingO 0.960 0.790
2 DTreeO 0.950 0.700
3 AdaBoostO 0.860 0.830
4 RFO 0.970 0.810
5 GBO 0.910 0.860
In [ ]:
comparison_frame3
Out[ ]:
Undersample Train_Recall Test_Recall
0 LogRU 0.864 0.846
1 DTreeU 0.850 0.840
2 AdaBoostU 0.812 0.792
3 BaggingU 0.822 0.846
4 RFU 0.889 0.870
5 GBU 0.873 0.875

Best models are LogRU (.864/.846), RFU(.889/.870), GBU(.873/.875).

I really want to treat this dataset for outliers. However, we have been told so many times not to eliminate genuine data especially when it represents continuous values, so I haven't done it.

  1. Tune and score the three best models.

A. LogRU (Logistic Regression with undersampling)

In [ ]:
# defining model
LogRU_tuned = LogisticRegression(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'C':np.arange(0.1,1.1,0.1)}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=LogRU_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'C': 0.1} with CV score=0.8674689826302731:
In [ ]:
# Set the clf to the best combination of parameters
LogRU_best = LogisticRegression(
    C=0.1,
    class_weight="balanced",
    dual=False,
    fit_intercept=True,
    l1_ratio=1,
    max_iter=100,
    multi_class="auto",
    n_jobs=-1,
    random_state=1,
    solver='lbfgs',
    tol=0.0001,
    verbose=2,
    warm_start=True)

# Fit the best algorithm to the data.
LogRU_best.fit(X_train_over, y_train_over)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
Out[ ]:
LogisticRegression(C=0.1, class_weight='balanced', l1_ratio=1, n_jobs=-1,
                   random_state=1, verbose=2, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.1, class_weight='balanced', l1_ratio=1, n_jobs=-1,
                   random_state=1, verbose=2, warm_start=True)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, LogRU_best.predict(X_train_un)))
print(accuracy_score(y_val, LogRU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, LogRU_best.predict(X_train_un)))
print(recall_score(y_val, LogRU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, LogRU_best.predict(X_train_un)))
print(precision_score(y_val, LogRU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, LogRU_best.predict(X_train_un)))
print(f1_score(y_val, LogRU_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.8552123552123552
0.8475
Recall on train and validation set
0.8481338481338482
0.8348348348348348
Precision on train and validation set
0.860313315926893
0.24428822495606328
F1 on train and validation set
0.8541801685029166
0.37797416723317473

Good accuarcy (0.855/0.847) and recall(0.848/0.834). Good scores and little indication of overfiting. Not so good on precision (0.860/0.244) which looks at the balance between TP and FP. Since the precision score is low, it makes sense that the F1 score (0.854/0.377) is also low.

B. RFU (Random Forest with undersampling)

In [ ]:
# defining model
RFU_tuned = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {"n_estimators":[100.150,200],
              "min_samples_leaf":np.arange(1,11,1),
              "max_features":[np.arange(0.10,0.80,0.1),'sqrt'],
              "max_samples":np.arange(0.2,0.9,0.10)}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=RFU_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 5, 'max_samples': 0.6000000000000001, 'max_features': 'sqrt'} with CV score=0.894449958643507:
In [ ]:
# Set the clf to the best combination of parameters
RFU_best = RandomForestClassifier(
    min_samples_leaf=5,
    max_samples=0.60,
    max_features='sqrt',
    bootstrap=True,)

# Fit the best algorithm to the data.
RFU_best.fit(X_train_un, y_train_un)
Out[ ]:
RandomForestClassifier(max_samples=0.6, min_samples_leaf=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, min_samples_leaf=5)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, RFU_best.predict(X_train_un)))
print(accuracy_score(y_val, RFU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, RFU_best.predict(X_train_un)))
print(recall_score(y_val, RFU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, RFU_best.predict(X_train_un)))
print(precision_score(y_val, RFU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, RFU_best.predict(X_train_un)))
print(f1_score(y_val, RFU_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.9253539253539254
0.853
Recall on train and validation set
0.9214929214929215
0.8678678678678678
Precision on train and validation set
0.9286640726329443
0.25643300798580304
F1 on train and validation set
0.9250645994832041
0.3958904109589042

Accuracy and recall scores not as close as LogRU, but alright. Accuracy (0.925/0.853). Recall score(0.921/0.867) indicatiing some overfitting0. Still low scores with significant overfitting for both precision (0.92/0.21) and F1 (0.92/0.39).

C. GBU (Gradient Boosting with undersampling)

In [27]:
# defining model
GBU_tuned = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators":np.arange(100,150,25),
            "learning_rate":[0.2,0.05,1.0],
             }


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=GBU_tuned, param_distributions=param_grid,
                                   n_iter=10, n_jobs = -1, verbose=2,
                                   scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters are {'n_estimators': 125, 'learning_rate': 0.2} with CV score=0.8880314309346569:
In [28]:
# Set the clf to the best combination of parameters
GBU_best = GradientBoostingClassifier ( n_estimators=125,
    learning_rate=0.2,
    )

# Fit the best algorithm to the data.
GBU_best.fit(X_train_un, y_train_un)
Out[28]:
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
In [ ]:
print("Accuracy on train and validation set")
print(accuracy_score(y_train_un, GBU_best.predict(X_train_un)))
print(accuracy_score(y_val, GBU_best.predict(X_val)))
print("Recall on train and validation set")
print(recall_score(y_train_un, GBU_best.predict(X_train_un)))
print(recall_score(y_val, GBU_best.predict(X_val)))
print("Precision on train and validation set")
print(precision_score(y_train_un, GBU_best.predict(X_train_un)))
print(precision_score(y_val, GBU_best.predict(X_val)))
print("F1 on train and validation set")
print(f1_score(y_train_un, GBU_best.predict(X_train_un)))
print(f1_score(y_val, GBU_best.predict(X_val)))
print("")
Accuracy on train and validation set
0.9851994851994852
0.8978333333333334
Recall on train and validation set
0.9716859716859717
0.8708708708708709
Precision on train and validation set
0.9986772486772487
0.3372093023255814
F1 on train and validation set
0.984996738421396
0.4861693210393964

The best tuned model is LogRU with an train/val accuracy score of 0.855/0.847, a recall score of 0.848/0.834, a precision score of 0.860/0.244 and an F1 score of 0.854/0.377. As stated earlier, the precision score compares TP to FP and F1 compares Precision and Recall or the balance between FP and FN or Type I and Type II errors.

  1. Final Model Feature importances
In [ ]:
print (pd.DataFrame(GBU_best.feature_importances_, columns = ["Imp"], index = X_train_un.columns).sort_values(by = 'Imp', ascending = False))
      Imp
V15 0.298
V3  0.211
V5  0.125
V25 0.114
V12 0.099
V29 0.066
V2  0.050
V24 0.038

Source: Easy Visa Project Learner Notebook Full Code

The most important features in this model are V15(30%), V3(21%), V5(13%), and V25(11%). That explains almost 75% of the the phenomenon.

Test set final performance¶

  1. Import the dataset.
In [3]:
test=pd.read_csv('/content/drive/MyDrive/Test.csv.csv')
  1. Make a copy.
In [4]:
wind_test=wind.copy()
  1. Drop extraneous variables.
In [5]:
wind_test=wind_test.drop(["V1","V4","V6","V7","V8","V9","V10","V11","V13","V14","V16","V17","V18",
                   "V19","V20","V21","V22","V23","V26","V27",
                   "V28","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40"],axis=1)
  1. Brief EDA
In [ ]:
wind_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V2      19982 non-null  float64
 1   V3      20000 non-null  float64
 2   V5      20000 non-null  float64
 3   V12     20000 non-null  float64
 4   V15     20000 non-null  float64
 5   V24     20000 non-null  float64
 6   V25     20000 non-null  float64
 7   V29     20000 non-null  float64
 8   Target  20000 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.4 MB
In [ ]:
wind_test.head()
Out[ ]:
V2 V3 V5 V12 V15 V24 V25 V29 Target
0 -4.679 3.102 -0.221 0.736 -3.376 3.133 0.665 -3.982 0
1 3.653 0.910 0.332 -0.951 0.193 1.766 -0.267 0.783 0
2 -5.824 0.634 -1.774 1.107 -3.164 1.680 -0.451 -2.034 0
3 1.888 7.046 0.083 0.460 -0.454 -1.818 2.124 -3.963 0
4 3.872 -3.758 3.793 4.724 -2.633 4.490 -3.945 5.107 0
In [ ]:
wind_test.describe()
Out[ ]:
V2 V3 V5 V12 V15 V24 V25 V29 Target
count 19982.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000
mean 0.440 2.485 -0.054 1.605 -2.415 1.134 -0.002 -0.986 0.056
std 3.151 3.389 2.105 2.930 3.355 3.912 2.017 2.684 0.229
min -12.320 -10.708 -8.603 -12.948 -16.417 -16.387 -8.228 -12.579 0.000
25% -1.641 0.207 -1.536 -0.397 -4.415 -1.468 -1.365 -2.787 0.000
50% 0.472 2.256 -0.102 1.508 -2.383 0.969 0.025 -1.176 0.000
75% 2.544 4.566 1.340 3.571 -0.359 3.546 1.397 0.630 0.000
max 13.089 17.091 8.134 15.081 12.246 17.163 8.223 10.722 1.000

Observation:V2 still contains missing values, so they must be imputed.

  1. Build the pipline.
In [11]:
# pipeline takes a list of tuples as parameter. The last entry is the call to the modeling algorithm
pipeline = Pipeline([
    ('scaler',StandardScaler()), 'under_sample',RandomUnderSampler(random_state=1,sampling_strategy=1),
    ('gr', GradientBoostingClassifier(learning_rate=0.2,n_estimators=125))
])

Source: Hands-on Notebook Pipeline and Make Pipeline

Note:This is the first pipeline I built. It includes all steps, but gave me an error message.

In [12]:
pipeline2 = Pipeline([
    ('scaler',StandardScaler()),
    ('bestgb', GradientBoostingClassifier(learning_rate=0.2,n_estimators=125))
])

source: Hands-on Pipline and Make Pipeline

  1. Split the data.
In [6]:
X = wind_test.drop("Target",axis=1)
y = wind_test.pop("Target")
  1. Impute the missing values.
In [7]:
# Let's impute the missing values
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")

# fit the imputer on train data and transform the train data
X["V2"] = imp_median.fit_transform(X[["V2"]])

6.Fit the pipline.

In [13]:
pipeline2.fit(X,y)
Out[13]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('bestgb',
                 GradientBoostingClassifier(learning_rate=0.2,
                                            n_estimators=125))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()),
                ('bestgb',
                 GradientBoostingClassifier(learning_rate=0.2,
                                            n_estimators=125))])
StandardScaler()
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)

Source: Hands_on Notebook Pipeline and Make Pipeline

In [14]:
pipeline2.score(X,y)
Out[14]:
0.9867
In [ ]:
pipeline2.score(X_train_un,y_train_un)
Out[ ]:
0.888030888030888

Source: Hands-on Notebook Pipeline and Make Pipeline

Note:I ran the score a second time because the first did not include the undersampling step, which was importante in fitting the model.

  1. Run the test data through the chosen model GBU (Gradient Boosting Classifier with Undersampling).
In [16]:
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_test_un, y_test_un = rus.fit_resample(X, y)


print("Before UnderSampling, counts of label '1': {}".format(sum(y == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_test_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_test_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_test_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_test_un.shape))
Before UnderSampling, counts of label '1': 1110
Before UnderSampling, counts of label '0': 18890 

After UnderSampling, counts of label '1': 1110
After UnderSampling, counts of label '0': 1110 

After UnderSampling, the shape of train_X: (2220, 8)
After UnderSampling, the shape of train_y: (2220,) 

In [17]:
GBU_best_test = GradientBoostingClassifier ( n_estimators=125,
    learning_rate=0.2,
    )

# Fit the best algorithm to the data.
GBU_best_test.fit(X_test_un, y_test_un)
Out[17]:
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, n_estimators=125)
  1. Compare scores
In [29]:
print("Accuracy on train and test set")
print(accuracy_score(y_train_un, GBU_best.predict(X_train_un)))
print(accuracy_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("Recall on train and test set")
print(recall_score(y_train_un, GBU_best.predict(X_train_un)))
print(recall_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("Precision on train and test set")
print(precision_score(y_train_un, GBU_best.predict(X_train_un)))
print(precision_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("F1 on train and test set")
print(f1_score(y_train_un, GBU_best.predict(X_train_un)))
print(f1_score(y_test_un, GBU_best_test.predict(X_test_un)))
print("")
Accuracy on train and test set
0.9851994851994852
0.9707207207207207
Recall on train and test set
0.9716859716859717
0.9585585585585585
Precision on train and test set
0.9986772486772487
0.9824561403508771
F1 on train and test set
0.984996738421396
0.9703602371181029

In [31]:
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth

    '''
    y_predict = model.predict(X_test_un)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (5,5))
    sns.heatmap(df_cm, annot=labels,cbar=False,cmap="Spectral",fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [33]:
make_confusion_matrix(GBU_best_test,y_test_un)

Source: Project_SLC_InnHotels_Project_FullCode

In [34]:
feature_names = list(X_train_un.columns)
importances = GBU_best_test.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Source: Project_SCL_DBSA_InnHotels_FullCode

In [30]:
print (pd.DataFrame(GBU_best_test.feature_importances_, columns = ["Imp"], index = X_test_un.columns).sort_values(by = 'Imp', ascending = False))
      Imp
V15 0.316
V3  0.197
V5  0.126
V12 0.112
V25 0.078
V29 0.065
V2  0.053
V24 0.052

Observations: Still slight overfitting on the training vs test data. However, the results are much better than expected.

  1. train/test accuracy (Total correct predictions vs total predictions): 0.93/0.97
  2. train/test recall (Type II errors or FN): 0.96/0.96
  3. train/test precision (Type I errors or FP): 0.90/0.98
  4. train/test F1 score (Balance between precision and recall): 0.93/0.97
  5. Feature importances: V15(32%), V3(20%), V5(13%), V12(11%), V25(8%), V2(5%), V24 (5%). Total expalaination: 94%.

Conclusions: This is an assignment for a course and not a genuine data analysis. I could continue with data engineering and drop non continuous outliers that lie more than 3 standard deviations above the median. I did not do that because if I continue data engineering, I will run out of time and be unable to complete the assignment by the deadline. I also checked the low-code version of this notebook and this was not one of the steps. If I did drop non-continuous outliers, I would be eliminating data points, but I believe this would also make the model more generalizable.

Business Insights and Conclusions¶

Please see presentation. Thank you.